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Abstract 


Deep dynamic generative models are developed to learn sequential dependencies 
in time-series data. The multi-layered model is designed by constructing a hierar¬ 
chy of temporal sigmoid belief networks (TSBNs), defined as a sequential stack 
of sigmoid belief networks (SBNs). Each SBN has a contextual hidden state, 
inherited from the previous SBNs in the sequence, and is used to regulate its hid¬ 
den bias. Scalable learning and inference algorithms are derived by introducing 
a recognition model that yields fast sampling from the variational posterior. This 
recognition model is trained jointly with the generative model, by maximizing its 
variational lower bound on the log-likelihood. Experimental results on bouncing 
balls, polyphonic music, motion capture, and text streams show that the proposed 
approach achieves state-of-the-art predictive performance, and has the capacity to 
synthesize various sequences. 


1 Introduction 

Considerable research has been devoted to developing probabilistic models for high-dimensional 
time-series data, such as video and music sequences, motion capture data, and text streams. Among 
them. Hidden Markov Models (HMMs) in and Linear Dynamical Systems (EDS) m have been 
widely studied, but they may be limited in the type of dynamical structures they can model. An 
HMM is a mixture model, which relies on a single multinomial variable to represent the history of a 
time-series. To represent N bits of information about the history, an HMM could require 2^ distinct 
states. On the other hand, real-world sequential data often contain complex non-linear temporal 
dependencies, while a EDS can only model simple linear dynamics. 

Another class of time-series models, which are potentially better suited to model complex probabil¬ 
ity distributions over high-dimensional sequences, relies on the use of Recurrent Neural Networks 
(RNNs) Em mm, and variants of a well-known undirected graphical model called the Restricted 
Boltzmann Machine (RBM) f7l l8l l9l fTOl [TTl . One such variant is the Temporal Restricted Boltz¬ 
mann Machine (TRBM) El, which consists of a sequence of RBMs, where the state of one or more 
previous RBMs determine the biases of the RBM in the current time step. Learning and inference in 
the TRBM is non-trivial. The approximate procedure used in El is heuristic and not derived from a 
principled statistical formalism. 

Recently, deep directed generative models iniiniiiiiia are becoming popular. A directed graph¬ 
ical model that is closely related to the RBM is the Sigmoid Belief Network (SBN) E). In the work 
presented here, we introduce the Temporal Sigmoid Belief Network (TSBN), which can be viewed 
as a temporal stack of SBNs, where each SBN has a contextual hidden state that is inherited from 
the previous SBNs and is used to adjust its hidden-units bias. Based on this, we further develop 
a deep dynamic generative model by constructing a hierarchy of TSBNs. This can be considered 
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Figure 1: Graphical model for the Deep Temporal Sigmoid Belief Network. (a,b) Generative and recognition 
model of the TSBN. (c,d) Generative and recognition model of a two-layer Deep TSBN. 

as a deep SBN na with temporal feedback loops on each layer. Both stochastic and deterministic 
hidden layers are considered. 

Compared with previous work, our model: (/) can be viewed as a generalization of an HMM with 
distributed hidden state representations, and with a deep architecture; (ii) can be seen as a gener¬ 
alization of a LDS with complex non-linear dynamics; (Hi) can be considered as a probabilistic 
construction of the traditionally deterministic RNN; (iv) is closely related to the TRBM, but it has a 
fully generative process, where data are readily generated from the model using ancestral sampling; 
(v) can be utilized to model different kinds of data, e.g., binary, real-valued and counts. 

The “explaining away” effect described in ifTTll makes inference slow, if one uses traditional in¬ 
ference methods. Another important contribution we present here is to develop fast and scalable 
learning and inference algorithms, by introducing a recognition model uni Ha [13, that learns an 
inverse mapping from observations to hidden variables, based on a loss function derived from a vari¬ 
ational principle. By utilizing the recognition model and variance-reduction techniques from 
we achieve fast inference both at training and testing time. 

2 Model Formulation 

2.1 Sigmoid Belief Networks 

Deep dynamic generative models are considered, based on the Sigmoid Belief Network (SBN) 1(161 . 
An SBN is a Bayesian network that models a binary visible vector v G {0,1}^, in terms of binary 
hidden variables h € {0,1}'^ and weights W € with 

p(ym = Mh) = a(w^h -f Cm), p{hj = 1) = (j(bj), (1) 

where v = h = [hi,. ■ ■, hj]^, W = [lUi,..., c = [ci, ..., cm]^, 

b = [bi,. .. and the logistic function, (t{x) = 1/(1 + e~^). The parameters W, b and c 

characterize all data, and the hidden variables, h, are specific to particular visible data, v. 

The SBN is closely related to the RBM ifTSll . which is a Markov random field with the same bipar¬ 
tite structure as the SBN. The RBM defines a distribution over a binary vector that is proportional 
to the exponential of its energy, defined as —E{v, h) = c + v^'Wh -f b. The conditional 
distributions, p{v\h) and p{h\v), in the RBM are factorial, which makes inference fast, while pa¬ 
rameter estimation usually relies on an approximation technique known as Contrastive Divergence 

(CD) m. 

The energy function of an SBN may be written as —E{v, h) = c-\-v^'Whh-h^ b—'Y^^ log(l + 
exp{Wmh + Cm))- SBNs explicitly manifest the generative process to obtain data, in which the 
hidden layer provides a directed “explanation” for patterns generated in the visible layer. However, 
the “explaining away” effect described in IflTl makes inference inefficient, the latter can be alleviated 
by exploiting recent advances in variational inference methods IfTSll . 
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2.2 Temporal Sigmoid Belief Networks 

The proposed Temporal Sigmoid Belief Network (TSBN) model is a sequence of SBNs arranged 
in such way that at any given time step, the SBN’s biases depend on the state of the SBNs in the 
previous time steps. Specifically, assume we have a length-T binary visible sequence, the tth time 
step of which is denoted Vt G {0,1}^. The TSBN describes the joint probability as 

T 

pe(V,H) =p(hi)p(vilhi) ■ Wp{ht \ht-i,vt-i) ■p{vt\ht,vt-i), (2) 

t^2 

where V = [vi,..., vt], H = \hi, ..., Ht], and each ht G {0,1}“^ represents the hidden state 

corresponding to time step t. For t = 1,... ,T, each conditional distribution in Q is expressed as 

p{hjt = l\ht-i,vt-i) = a{wjjht-i + wJjVt-i + bj), (3) 

Pivmt = l\ht,vt-i) = aiwl^ht +€„), ( 4 ) 

where /iq and Vq, needed for the prior model p{hi) and p{vi\hi), are defined as zero vectors, 
respectively, for conciseness. The model parameters, 6, are specified as Wi G W 2 G 

]jA/x Wg g j^JxM^ g j^MxM^ Pqj. i 2 , 3 ,4, Wij is the transpose of the jth row of W^, 
and c = [ci,..., cm]^ and b = [ 61 ,..., bj]^ are bias terms. The graphical model for the TSBN is 
shown in Figure Qa). 

By setting W3 and W4 to be zero matrices, the TSBN can be viewed as a Hidden Markov Model 
m with an exponentially large state space, that has a compact parameterization of the transition and 
the emission probabilities. Specifically, each hidden state in the HMM is represented as a one-hot 
length-J vector, while in the TSBN, the hidden states can be any length-J binary vector. We note 
that the transition matrix is highly structured, since the number of parameters is only quadratic w.r.t. 
J. Compared with the TRBM || 8 l, our TSBN is fully directed, which allows for fast sampling of 
“fantasy” data from the inferred model. 


2.3 TSBN Variants 


Modeling real-valued data The model above can be readily extended to model real-valued se¬ 


quence data, by substituting (14 1 with p{vt\ht, Vt-i) = A/'(/rt, diag(cr()), where 


Pmt = wj^ht + wJ.^Vt -1 + Cm, log (7^ = + iw4mV -f c'„ 


(5) 


and pmt and are elements of pt and cr^, respectively. Wj and W 4 are of the same size of 
W2 and W4, respectively. Compared with the Gaussian TRBM ||9], in which amt is fixed to 1, our 
formalism uses a diagonal matrix to parameterize the variance structure of Vt. 


Modeling count data We also introduce an approach for modeling time-series data with count 


observations, by replacing (14 1 with /it, = 11 ™=! 2 /^*’"'here 


exp(n;J„h.t + + Cm) 

^ . T ^ ^ T -1-; ■ 

Em'=i exp(ut^„,h.t + + Cm') 

This formulation is related to the Replicated Softmax Model (RSM) described in lfT9l . however, our 
approach uses a directed connection from the binary hidden variables to the visible counts, while 
also learning the dynamics in the count sequences. 


Furthermore, rather than assuming that ht and Vt only depend on ht-i and Vt-i, in the experiments, 
we also allow for connections from the past n time steps of the hidden and visible states, to the 
current states, ht and Vt. A sliding window is then used to go through the sequence to obtain n 
frames at each time. We refer to n as the order of the model. 


2.4 Deep Architecture for Sequence Modeling with TSBNs 

Learning the sequential dependencies with the shallow model in ([2]l-([T^ may be restrictive. There¬ 
fore, we propose two deep architectures to improve its representational power: (i) adding stochastic 
hidden layers; (ii) adding deterministic hidden layers. The graphical model for the deep TSBN 
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is shown in Figure |^c). Specifically, we consider a deep TSBN with hidden layers for 
t = 1,... ,T and £ = 1,... ,L. Assume layer £ contains hidden units, and denote the visi¬ 
ble layer Vt = and let = 0, for convenience. In order to obtain a proper generative 

model, the top hidden layer contains stochastic binary hidden variables. 


For the middle layers, £=1,...,L—1, if stochastic hidden layers are utilized, the generative process 


is expressed as = 11^=1 where each conditional distribution 

is parameterized via a logistic function, as in ( [T4| . If deterministic hidden layers are employed, 
we obtain where /(•) is chosen to be a rectified linear function. 

Although the differences between these two approaches are minor, learning and inference algorithms 




can be quite different, as shown in Section 3.3 


3 Scalable Learning and Inference 

Computation of the exact posterior over the hidden variables in (|^ is intractable. Approximate 
Bayesian inference, such as Gibbs sampling or mean-field variational Bayes (VB) inference, can 
be implemented CSlIIll. However, Gibbs sampling is very inefficient, due to the fact that the 
conditional posterior distribution of the hidden variables does not factorize. The mean-field VB 
indeed provides a fully factored variational posterior, but this technique increases the gap between 
the bound being optimized and the true log-likelihood, potentially resulting in a poor fit to the data. 
To allow for tractable and scalable inference and parameter learning, without loss of the flexibility of 
the variational posterior, we apply the Neural Variational Inference and Learning (NVIL) algorithm 
described in lfT3l . 

3.1 Variational Lower Bound Objective 

We are interested in training the TSBN model, pe(V, H), described in (|^, with parameters 6. 
Given an observation V, we introduce a fixed-form distribution, q(^(H|V), with parameters (p, that 
approximates the true posterior distribution, p(H|V). We then follow the variational principle to 
derive a lower bound on the marginal log-likelihood, expressed a^ 

£(V, 0, 4>) = E,^(h|v) [logpe(V, H) - log q^{U\V)]. (7) 

We construct the approximate posterior (/^(HIV) as a recognition model. By using this, we avoid 
the need to compute variational parameters per data point; instead we compute a set of parameters 
4> used for all V. In order to achieve fast inference, the recognition model is expressed as 

T 

g,^(H|V) = q{hi\vi) ■ Y[q{ht\ht-i,Vt,Vt-i), ( 8 ) 

t^2 

and each conditional distribution is specified as 

q{hjt = l\ht-i,vt,vt-i) = a{ujjht-i + uJjVt + uJjVt-i + dj ), (9) 

where ho and Vq, for q(hilvi), are defined as zero vectors. The recognition parameters (p are 
specified as Ui S U 2 S U 3 S For i = 1,2,3, tty is the transpose of the jth 

row of Ui, and d = [di,..., is the bias term. The graphical model is shown in Figurej^b). 

The recognition model defined in (|^ has the same form as in the approximate inference used for the 
TRBM || 8 l. Exact inference for our model consists of a forward and backward pass through the entire 
sequence, that requires the traversing of each possible hidden state. Our feedforward approximation 
allows the inference procedure to be fast and implemented in an online fashion. 

3.2 Parameter Learning 

To optimize (|^, we utilize Monte Carlo methods to approximate expectations and stochastic gradient 
descent (SGD) for parameter optimization. The gradients can be expressed as 

Ve£(V) =E,^(H|v)[Velogpe(V,H)], (10) 

V<^/:(V) =E,^(H|v)[(logpe(V,H)-logg^(H|V)) x V<^logg^(H|V)]. (11) 

’This lower bound is equivalent to the marginal log-likelihood if g<^(H| V) = p(H| V). 


4 




specifically, in the TSBN model, if we define Vmt = + Cm) and hjt = 

a{ujjht-i + uljVt + uljVt-i + dj), the gradients for W 2 m and U 2 j can be calculated as 


91ogp^(V,H) 

dw2mj 


T 

^ i^mt) ■ ^jt: 


dlogq^niV) 

du2jm 


T 

~ hjt) * Vmt- 


( 12 ) 


Other update equations, along with the learning details for the TSBN variants in Section 2.3 


are 


provided in the Supplementary Section B. We observe that the gradients in ( [TOl and 0 share many 
similarities with the wake-sleep algorithm ll20l . Wake-sleep alternates between updating 6 in the 
wake phase and updating 0 in the sleep phase. The update of 6 is based on the samples generated 
from ( 70 (H|V), and is identical to ( p^ . However, in contrast to ( [IT] ), the recognition parameters (f> 
are estimated from samples generated by the model, i.e., V 0 £(v 7 = Epg(v,H) [V</) log f?(/)(H|V)]. 
This update does not optimize the same objective as in ( [T0) l, hence the wake-sleep algorithm is not 
guaranteed to converge M- 


Inspecting ( [TT] l, we see that we are using H) = logpe(V, H) — log g 0 (H|V) as the learning 
signal for the recognition parameters </>. The expectation of this learning signal is exactly the lower 
bound which is easy to evaluate. However, this tractability makes the estimated gradients of the 
recognition parameters very noisy. In order to make the algorithm practical, we employ the variance 
reduction techniques proposed in ifTJI . namely: (i) centering the learning signal, by subtracting the 
data-independent baseline and the data-dependent baseline; (ii) variance normalization, by dividing 
the centered learning signal by a running estimate of its standard deviation. The data-dependent 
baseline is implemented using a neural network. Additionally, RMSprop lISTl . a form of SGD where 
the gradients are adaptively rescaled by a running average of their recent magnitude, were found 
in practice to be important for fast convergence; thus utilized throughout all the experiments. The 
outline of the NVIL algorithm is provided in the Supplementary Section A. 


3.3 Extension to deep models 

The recognition model corresponding to the deep TSBN is shown in Figurej^d). Two kinds of deep 
architectures are discussed in Section 2.4| We illustrate the difference of their learning algorithms 
in two respects: (/) the calculation of the lower bound; and (ii) the calculation of the gradients. 

The top hidden layer is stochastic. If the middle hidden layers are also stochastic, the calculation 
of the lower bound is more involved, compared with the shallow model; however, the gradient 
evaluation remain simple as in ( [T2] i. On the other hand, if deterministic middle hidden layers (i.e., 
recurrent neural networks) are employed, the lower bound objective will stay the same as a shallow 
model, since the only stochasticity in the generative process lies in the top layer; however, the 
gradients have to be calculated recursively through the back-propagation through time algorithm 
f22\ . All details are provided in the Supplementary Section C. 


4 Related Work 

The RBM has been widely used as building block to learn the sequential dependencies in time-series 
data, e.g., the conditional-RBM-related models QlUl, and the temporal RBM ||8l. To make exact 
inference possible, the recurrent temporal RBM was also proposed Q, and further extended to learn 
the dependency structure within observations mu. 

In the work reported here, we focus on modeling sequences based on the SBN ifThl . which recently 
has been shown to have the potential to build deep generative models nails] |24l. Our work serves 
as another extension of the SBN that can be utilized to model time-series data. Similar ideas have 
also been considered in ESI and ESI . However, in ESI , the authors focus on grammar learning, and 
use a feed-forward approximation of the mean-field VB to carry out the inference; while in ESIl . the 
wake-sleep algorithm was developed. We apply the model in a different scenario, and develop a fast 
and scalable inference algorithm, based on the idea of training a recognition model by leveraging 
the stochastic gradient of the variational bound. 

There exist two main methods for the training of recognition models. The first one, termed Stochas¬ 
tic Gradient Variational Bayes (SGVB), is based on a reparameterization trick iiniiii, which can 
be only employed in models with continuous latent variables, e.g., the variational auto-encoder lIT^ 


5 







Figure 2: (Left) Dictionaries learned using the HMSBN for the videos of bouncing balls. (Middle) 
Samples generated from the HMSBN trained on the polyphonic music. Each column is a sample 
vector of notes. (Right) Time evolving from 1790 to 2014 for three selected topics learned from the 
STU dataset. Plotted values represent normalized probabilities that the topic appears in a given year. 
Best viewed electronically. 

and all the recent recurrent extensions of it lIZTl l2^ l29l . The second one, called Neural Variational 
Inference and Learning (NVIL), is based on the log-derivative trick II3, which is more general and 
can also be applicable to models with discrete random variables. The NVIL algorithm has been 
previously applied to the training of SBN in ifTSll . Our approach serves as a new application of this 
algorithm for a SBN-based time-series model. 

5 Experiments 

We present experimental results on four publicly available datasets: the bouncing balls lEl, poly¬ 
phonic music iflOll . motion capture ITJ and state-of-the-Union OOl . To assess the performance of the 
TSBN model, we show sequences generated from the model, and report the average log-probability 
that the model assigns to a test sequence, and the average squared one-step-ahead prediction error per 
frame. Code is available at https : / / git hub . com/ zhegan2 7 / TSBN_code_NIPS2 015 

The TSBN model with W 3 = 0 and W 4 = 0 is denoted Hidden Markov SBN (HMSBN), the deep 
TSBN with stochastic hidden layer is denoted DTSBN-S, and the deep TSBN with deterministic 
hidden layer is denoted DTSBN-D. 

Model parameters were initialized by sampling randomly from JV{0, O.OOl^I), except for the bias 
parameters, that were initialized as 0. The TSBN model is trained using a variant of RMSprop 
©, with momentum of 0.9, and a constant learning rate of 10 The decay over the root mean 
squared gradients is set to 0.95. The maximum number of iterations we use is 10®. The gradient 
estimates were computed using a single sample from the recognition model. The only regularization 
we used was a weight decay of 10“"^. The data-dependent baseline was implemented by using a 
neural network with a single hidden layer with 100 tanh units. 

For the prediction of Vt given Vi-t-i, we (/) first obtain a sample from qcj,{hi:t-i\vi:t-i)', (ii) 
calculate the conditional posterior p 0 {ht\hi.t-i,vi.t-i) of the current hidden state ; (Hi) make a 
prediction for Vt using pg{vt\hi.t, On the other hand, synthesizing samples is conceptually 

simper. Sequences can be readily generated from the model using ancestral sampling. 

5.1 Bouncing balls dataset 

We conducted the hrst experiment on synthetic videos of 3 bouncing balls, where pixels are binary 
valued. We followed the procedure in ||9l, and generated 4000 videos for training, and another 200 
videos for testing. Each video is of length 100 and of resolution 30 x 30. 

The dictionaries learned using the HMSBN are shown in Figure(Left). Compared with previous 
work II 9 I [Toll , our learned bases are more spatially localized. In Table we compare the average 
squared prediction error per frame over the 200 test videos, with recurrent temporal RBM (RTRBM) 
and structured RTRBM (SRTRBM). As can be seen, our approach achieves better performance 
compared with the baselines in the literature. Furthermore, we observe that a high-order TSBN 
reduces the prediction error signihcantly, compared with an order-one TSBN. This is due to the fact 
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Table 1: Average prediction error for the bounc- Table 2: Average prediction error obtained for 

ing balls dataset, (o) taken from 1111 . the motion capture dataset, (o) taken from IIII . 


Model 

Dim 

Order 

Fred. Err. 

Model 

Walking 

Running 

DTSBN-S 

100-100 

2 

2.79 ± 0.39 

DTSBN-S 

4.40 ± 0.28 

2.56 ± 0.40 

DTSBN-D 

100-100 

2 

2.99 ± 0.42 

DTSBN-D 

4.62 ±0.01 

2.84 ± 0.01 

TSBN 

100 

4 

3.07 ib 0.40 

TSBN 

5.12 ± 0.50 

4.85 ± 1.26 

TSBN 

100 

1 

9.48 ±0.38 

HMSBN 

10.77± 1.15 

7.39 ± 0.47 

RTRBM“ 

3750 

1 

3.88 ±0.33 

SS-SRTRBM“ 

8.13 ±0.06 

5.88 ± 0.05 

SRTRBM^ 

3750 

1 

3.31 ± 0.33 

G-RTRBM* 

14.41 ± 0.38 

10.91 ± 0.27 


that by using a high-order TSBN, more information about the past is conveyed. We also examine 
the advantage of employing deep models. Using stochastic, or deterministic hidden layer improves 
performances. More results, including log-likelihoods, are provided in Supplementary Section D. 

5.2 Motion capture dataset 

In this experiment, we used the CMU motion capture dataset, that consists of measured joint angles 
for different motion types. We used the 33 running and walking sequences of subject 35 (23 walking 
sequences and 10 running sequences). We followed the preprocessing procedure of am, after which 
we were left with 58 joint angles. We partitioned the 33 sequences into training and testing set: the 
first of which had 31 sequences, and the second had 2 sequences (one walking and another running). 
We averaged the prediction error over 100 trials, as reported in Table |7] The TSBN we implemented 
is of size 100 in each hidden layer and order 1. It can be seen that the TSBN-based models improves 
over the Gaussian (G-)RTRBM and the spike-slab (SS-)SRTRBM significantly. 



Figure 3: Motion trajectories generated from the HMSBN trained on the motion capture dataset. 
(Left) Walking. (Middle) Running-running-walking. (Right) Running-walking. 

Another popular motion capture dataset is the MIT datasej^ To further demonstrate the directed, 
generative nature of our model, we give our trained HMSBN model different initializations, and 
show generated, synthetic data and the transitions between different motion styles in Figure]^ These 
generated data are readily produced from the model and demonstrate realistic behavior. The smooth 
trajectories are walking movements, while the vibrating ones are running. Corresponding video files 
(AVI) are provided as mocap 1, 2 and 3 in the Supplementary Material. 

5.3 Polyphonic music dataset 

The third experiment is based on four different polyphonic music sequences of piano m, i-e., 
Piano-midi.de (Piano), Nottingham (Nott), MuseData (Muse) and JSB chorales (JSB). Each of these 
datasets are represented as a collection of 88-dimensional binary sequences, that span the whole 
range of piano from AO to C8. 

The samples generated from the trained HMSBN model are shown in Figure [^(Middle). As can 
be seen, different styles of polyphonic music are synthesized. The corresponding MIDI files are 
provided as music 1 and 2 in the Supplementary Material. Our model has the ability to learn basic 
harmony rules and local temporal coherence. However, long-term structure and musical melody 
remain elusive. The variational lower bound, along with the estimated log-likelihood in ifTOl . are 
presented in Table The TSBN we implemented is of size 100 and order 1. Empirically, adding 
layers did not improve performance on this dataset, hence no such results are reported. The results 
of RNN-NADE and RTRBM ifTOl were obtained by only 100 runs of the annealed importance sam¬ 
pling, which has the potential to overestimate the true log-likelihood. Our variational lower bound 
provides a more conservative estimate. Though, our performance is still better than that of RNN. 

^Quantitative results on the MIT dataset are provided in Supplementary Section D. 
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Table 3: Test log-likelihood for the polyphonic 
music dataset, (o) taken from 1101. 


Model 

Piano. 

Nott. 

Muse. 

JSB. 

TSBN 

-7.98 

-3.67 

-6.81 

-7.48 

RNN-NADE" 

-7.05 

-2.31 

-5.60 

-5.56 

RTRBM° 

-7.36 

-2.62 

-6.35 

-6.35 

RNN° 

-8.37 

-4.46 

-8.13 

-8.71 


Table 4: Average prediction precision for STU. 
(o) taken from BTI . 


Model 

DIM 

MP 

PP 

HMSBN 

25 

0.327 ±0.002 

0.353 ± 0.070 

DHMSBN-s 

25-25 

0.299 ± 0.001 

0.378 ±0.006 

GP-DPFA ^ 

100 

0.223 ± 0.001 

0.189 ±0.003 

DRFM^ 

25 

0.217 ± 0.003 

0.177 ±0.010 


5.4 State of the Union dataset 

The State of the Union (STU) dataset contains the transcripts of T = 225 US State of the Union ad¬ 
dresses, from 1790 to 2014. Two tasks are considered, i.e., prediction and dynamic topic modeling. 

Prediction The prediction task is concerned with estimating the held-out words. We employ the 
setup in OTl . After removing stop words and terms that occur fewer than 7 times in one document or 
less than 20 times overall, there are 2375 unique words. The entire data of the last year is held-out. 
For the documents in the previous years, we randomly partition the words of each document into 
80%/20% split. The model is trained on the 80% portion, and the remaining 20% held-out words 
are used to test the prediction at each year. The words in both held-out sets are ranked according to 
the probability estimated from (j^. 

To evaluate the prediction performance, we calculate the precision @top-Mas in ED, which is given 
by the fraction of the top-M words, predicted by the model, that matches the true ranking of the word 
counts. M = 50 is used. Two recent works are compared, GP-DPFA BTI and DRFM BOl . The 
results are summarized in Table |4] Our model is of order 1. The column MP denotes the mean 
precision over all the years that appear in the training set. The column PP denotes the predictive 
precision for the final year. Our model achieves significant improvements in both scenarios. 

Dynamic Topic Modeling The setup described in IMi is employed, and the number of topics is 
200. To understand the temporal dynamic per topic, three topics are selected and the normalized 
probability that a topic appears at each year are shown in Figure]^ (Right). Their associated top 6 
words per topic are shown in Table The learned trajectory exhibits different temporal patterns 
across the topics. Clearly, we can identify jumps associated with some key historical events. For 
instance, for Topic 29, we observe a positive jump in 1986 related to military and paramilitary 
activities in and against Nicaragua brought by the U.S. Topic 30 is related with war, where the War 
of 1812, World War 11 and Iraq War all spike up in their corresponding years. In Topic 130, we 
observe consistent positive jumps from 1890 to 1920, when the American revolution was taking 
place. Three other interesting topics are also shown in Table Topic 64 appears to be related to 
education. Topic 70 is about Iraq, and Topic 74 is Axis and World War II. We note that the words 
for these topics are explicitly related to these matters. 


Table 5: Top 6 most probable words associated with the STU topics. 


Topic #29 

Topic #30 

Topic #130 

Topic #64 

Topic #70 

Topic #74 

family 

officer 

government 

generations 

Iraqi 

Philippines 

budget 

civilized 

country 

generation 

Qaida 

islands 

Nicaragua 

warfare 

public 

recognize 

Iraq 

axis 

free 

enemy 

law 

brave 

Iraqis 

Nazis 

future 

whilst 

present 

crime 

AI 

Japanese 

freedom 

gained 

citizens 

race 

Saddam 

Germans 


6 Conclusion 

We have presented the Deep Temporal Sigmoid Belief Networks, an extension of SBN, that mod¬ 
els the temporal dependencies in high-dimensional sequences. To allow for scalable inference and 
learning, an efficient variational optimization algorithm is developed. Experimental results on sev¬ 
eral datasets show that the proposed approach obtains superior predictive performance, and synthe¬ 
sizes interesting sequences. 

In this work, we have investigated the modeling of different types of data individually. One interest¬ 
ing future work is to combine them into a unified framework for dynamic multi-modality learning. 
Furthermore, we can use high-order optimization methods to speed up inference ||32l. 
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A Outline of the NVIL algorithm 


The outline of the NVIL algorithm for computing gradients are shown below (reproduced from (T3)). C\{vt) 
represents the data-dependent baseline, and a = 0.8 throughout the experiments. 

Algorithm 1 Compute gradient estimates for the model pa¬ 
rameters and recognition parameters. 

A0 ^ 0, Ac/) ^ 0, AA ^ 0 

for f -c— 1 to T do 

ht ~ qci,{ht\vt) 

It ■«- \ogpe{vt,ht) - log q,i>{ht\vt) 
jC <— jC It 

It^lt- Cx{vt) 

end for 

Cf, c— mean((i,..., It) 

Vb variance(Zi,..., It) 
c QC -I- (1 — a)cb 
w •<— on -I- (1 — <y.)vb 
for f -c— 1 to T do 

/, .(_ L-e _ 

^ max(l,-\/t;) 

A0 A0 + Ve logpe{vt, ht) 

A(f> ^ Aef) + log q^{ht\vt) 

A\^ AX + ltVxCxivt) 

end for 


B Learning and Inference Details on TSBN 


Fort = 1,..., T, consider vt £ {0,1}^, ht € {0,1}"^, the model parameters 0 are specified as Wi € 

W2 G W3 G W4 G h G R'^, and c G The generative model is expressed as 

p{hjt = nt-i) = (T{wljht-i +wJjVt-i + bj), (13) 

p{vnit = l\ht, vt-i) = a(wj^ht + wj^vt-i + Cm) , (14) 

The recognition model is expressed as 

q{hjt = l\ht-\,Vt,vt--L) ^ a{uljht-\ + uljVt + uJjVt-i + dj), (15) 

where the recognition parameters are specified as Ui G U 2 G U 3 G and d G R"^. ho 

and Vo, needed for p(vi |/ii) and q{h-i\vi), are defined as zero vectors, for conciseness. 


In order to implement the NVIL algorithm described in HU, we need to calculate the lower bound and also the 
gradients. Specifically, we have the variational lower bound C = '^q^{h\v) [h], where It is expressed as 

,7 M 

[ipmlvmt - log(l -F exp(V>f^]))^ (16) 

j = l m = l 

■ ,7 

“ Y + exp(i/>®))) , 

.j=i 

and we have defined 


Yt = wjjht-i + wJjVt-i -F bj , 

^m\ = wJmht -F wJmVt-1 + Cm , 

= uJjht-1 + uJjVt -F uJjVt-1 + dj . 

By further defining 


(17) 

(18) 
(19) 


Xft = hjt - (T{Yt), Xml = Vmt - (^{tPml), = ^jt - Cr(t/)'f ), (20) 

The gradients for the model parameters 6 are expressed as 


91ogpe(ut, ht) 

= xfthj,t-i, 

dlogpg{vt, ht) 

(1) 

“ X.jt ‘^mt — 1 ) 

aiogpe(ut, ht) 

= xft 

dWijp 

dWZj m 

dhj 

dlogpg{vt, ht) 
OW2mj 

= Xmthjt, 

dlogpg{vt, ht) 
Ow^mm' 

(2) 

Xmt'^rn't — l 5 

dlogpg{vt, ht) 
dCm 

= Xml 
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The gradients for the recognition parameters cf) are expressed as 


d\ogq^{ht\vt) _ ^X3)u 

du. ' 




d log q 4 ,{ht\vt) _ ^ (3)„ 

O — Xjt ^rnt — 1 , 


d log qx,{ht\vt) _ (3) 

Cl Xjt 

^U2jm 

9log q<i,(ht\vt) ^ (3) 

ddi ■ ■ 


(23) 

(24) 


B.l Modeling Real-valued Data 

When modeling real-valued data, we substitute 1 14 1 with p{vt\ht, vt-i) = diag((Tt)), where 

fimt = -I- wJ^Vt-1 + Cm, logamt = {w'2mV ht + (m4^)^Ut_l -|- Cm, 


(25) 


and we have W 2 G R" and W 4 G . The recognition model remains the same as in 1 15 1 . Let 

Tmt = log (Jmt, we obtain 

(t = + ( \ log 27r + Tmt + 

j = l m=l ^ ’ 

■ .7 

~ + exp(t/'jf))) 




All the gradient calculation remains the same as l|21[(-l|24[l, except the following. 


d log pe{vt,ht) 

= Xmlhjt, 

dlogpeivt, ht) 

(4) 

Xmt'^rn't — 1 7 

91ogpe(vt, ht) 
dCm 

(4) 

(27) 

dlogpe{vt,ht) 

= XmYt, 

aiogpe(ut, ht) 

(5) 

“ Xmt'^rn't — l 7 

dlogpe{vt, ht) 

(5) 

^ Xmt t 

(28) 

dw'2mj 

'dw'. , 

^mm' 

3c' 


where we have defined 

(4) _ a log Pe(w,/it) _ Wmt - Itmt {5) _ dlogpeivt, ht) _ (Vmt - PmtT ^ 

“ dpmt ^ ~ dTmt ~ ^ ’ 

B.l Modeling Count Data 

We also introduce an approach for modeling time-series data with count observations, by replacing l |14| > with 

p{vt\ht,Vt-i) = Y\m^i where 


ymt 


exp{wJmht -f wJmVt-l + Cm) 
E"=i exp(-mj^,ht -f wj^,vt-i + Cm') 

The recognition model still remains the same as in GD- The It now is expressed ; 


3 = 1 
J 


(t = X] - log(l -f exp(t/)jt’))j -f Vm\vrr,t - Vmt log ^ exp(V>)^]) 


(30) 


(31) 


Y - log(l + exp(V'j?'))) 


.3 = 1 

All the gradient calculations remain the same as l|21[>-l|24[l, except the following 

aiogp0(ut,h,t) _ (6) aiogpe(ut,/it) _ (6) aiogpe(vf,/it) _ (6) 

-X^At. am 4 ^^, -XmtVm't-1, -X^f (32) 

where we have defined Xmt = ^rnt - ymt EmEi I'm't- 


C Learning and Inference Details on Deep TSBN 

For the ease of notation, we consider a two-hidden-layer deep TSBN here, which can be readily extended to a 
deep model with any depth. For t = 1,..., T, we consider the observation as i 3 t G {0,1}*^. The top hidden 
layer is denoted as zt G {0, l}"^. 
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Figure 4: Generative and recognition model of a two-layer Deep TSBN. 


C.l Using stochastic hidden layer 


Denote the first stochastic hidden layer as ht G {0,1}*^. The generative model is expressed as 

p{zjt = 1) = a{wJjZt-i + wjjht-i + bij), (33) 

p{hkt = 1) = a(wJkZt + wJkht-i + wJkVt-i + b2k), (34) 

p(Vmt = 1) = (T{wJ^ht + W^^Vt-l + bam) , (35) 

where we have defined Wi G Wa G W 3 G W 4 G W 5 G 

We G R^^", and W 7 G R"""". The bias terms are 61 G ba G R^^^ and bg £ R^^^ The 

corresponding recognition model is expressed as 

q{hkt = 1) = (T{uJkVt + uJkht-i + uJkVt-i + C 2 k) (36) 

q{zjt = 1) = CT{uJjht -I- uJjZt-i + ujjht-i + Cij) (37) 

where the recognition parameters are specified as Ui G U 2 G U 3 G U 4 G R^^^, 

U 5 G R^^*^ and Ue G The bias terms are Ci G R"^^^ and ca G R^^^. Now, It is expressed as 


J K 

^ - log(l + exp(t/>'t'))) -f [^i]^hkt - log(l + exp(t/)^j'))) 

j = l 
M 

+ Y (V’mlttmt - log(l + exp(V>®))) (38) 

■m = l 

“ {^kthkt - log(l + exp(V)it’))^ + Y - log(l + exp(V'j?’))) , 


and we have defined 

= wJjZt-i -1- wjjht-i + bij , (39) 

= wJkZt + wJkht-i + wJkVt-i + b2k , (40) 

V’mt = 'U’Jmht + wJmVt-1 + bsm , (41) 

= uJkVt + uJkht -1 + uJkVt -1 + C 2 k , (42) 

t/ijf = uljht + uljZt-\ + uJjht-1 + Cij . (43) 

All the gradients can be calculated readily as in l|21^-(|24^. 


C.l Using deterministic hidden layer 

For the generative model, denote the deterministic hidden layer as b® G R^. For the recognition model, denote 
the deterministic hidden layer as G R^. Wg and Ug are set to be zero matrices for the ease of gradient 
calculation. The generative model is expressed as 

p{zjt = 1) = a{wJjZt-i + bij) , 

Kt = f{wJkZt + wJkh^_i + wJkVt-i + b2k) , 
p{Vmt = 1) = o(wJmht + wJmVt-1 + bsm) , 
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(44) 

(45) 

(46) 



























Table 6 : Average prediction error and the average negative log-likelihood per 
frame for the bouncing balls dataset, (o) taken from 1111 . 


Model 

Dim 

Order 

Pred. Err. 

Neg. Log. Like. 

DTSBN-S 

100-100 

2 

2.79 ± 0.39 

69.29 ± 1.52 

DTSBN-D 

100-100 

2 

2.99 ± 0.42 

70.47 ± 1.52 

DTSBN-S 

100-100 

1 

10.39 ± 0.38 

78.63 ±0.92 

TSBN 

100 

4 

3.07 ± 0.40 

70.41 ± 1.55 

TSBN 

100 

2 

4.00 ± 0.45 

73.32 ± 1.75 

TSBN 

100 

1 

9.48 ± 0.38 

77.71 ± 0.83 

HMSBN 

100 

1 

23.94 ± 0.41 

86.27 ±0.80 

AR 

0 

2 

3.63 ± 0.42 

73.80 ± 1.46 

AR 

0 

1 

11.01 ± 0.24 

93.61 ±0.67 

RTRBM® 

3750 

1 

3.88 ± 0.33 

— 

SRTRBM^ 

3750 

1 

3.31 ± 0.33 

- 


The corresponding recognition model is expressed as 

hit = fiuJkVt + uj^ht-i + ulj,vt-i + C2k) (47) 

q{zjt = 1) = aiuljht + uJjZt-i + cij) (48) 

Hence, C = and It is expressed as 

J M 

- log(l + ” log(l -f exp(t/)f^]))^ (49) 

j = l m = l 

- + exp(t/'jf))) , 

J=i 

and we have defined 

= wJjZt -1 + bij , (50) 

i’mt = 'wjruht + wJmVt-l + bsm , (51) 

= ujjhl + uJjZt -1 + Cij . (52) 

The gradients w.r.t. Wi, W 5 , W 7 ,Ui and U 2 can be calculated easily. In order to calculate the gradients 
w.r.t. W 2 , W 4 , Wg, U 4 , Us and Ug, we need to obtain and which can be calculated recursively 

via the back-propagation through time algorithm. Specifically, and we have defined 


T M 



Ql = X] ^ - l0g(l ± exp(V>f^]))^ . 

t=l m = l 

(53) 

We observe that Q\ can 

be computed recursively using 



T M 

Qt = X] X] (V'mrIImr “ log(l ± exp('(^^|))^ 

(54) 


M 

= Qt + l ± ^ - log(l ± exp(V)f^]))^ , 

(55) 

where Qt+i = 0. Using the chain rule, we have 


dQt 

dhl 

^ Qfl^ Qh^ Z-^ W5mfe(Wmt O-(V’mt)) 

k't + 1 kt m = l 

(56) 


k't-\-l m=l 

(57) 

where we have defined 

■^tt = ± ± + h2k , 

(58) 

and 

OQt , , ,{2) 

QhS ^ Z^W5mk(v^T Cr(V’^T))- 
kt m=l 

(59) 


can be calculated similarly. 
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Table 7: Average prediction error obtained for 
the MIT motion capture dataset. 


Model 

Bred. 

Err. 

DTSBN-S 

3.71 

± 

0.03 

DTSBN-D 

4.19 

± 

0.01 

TSBN 

3.86 

± 

0.02 

HMSBN 

17.49 

± 

0.20 


D Additional Results 

D.l Generated Data 

The generated, synthetic motion capture data, and polyphonic music data can be downloaded from https : 
//drive.google.com/drive/u/0/folders/OBlHR6m3IZSO_SWtOaSloYmlneDQ 

D.2 Bouncing balls dataset 

Additional experimental results are shown in Tahle|^ AR represents an auto-regressive Markov model without 
latent variables. 


D.3 MIT motion capture dataset 

We randomly select 10% of the dataset as the test set. Quantitative results are shown in Table 
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